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Abstract 

The predictability of errors in deterministic temperature forecasts is 
investigated. More precisely, the aim is to issue warnings whenever the 
differences between forecast and verification exceed a given threshold. The 
warnings are generated by analyzing the output of an ensemble forecast 
system in terms of a decision making approach. The quality of the result- 
ing predictions is evaluated by computing receiver operating characteris- 
tics, the Brier score, and the Ignorance score. Special emphasis is also 
given to the question whether rare events are better predictable. 



1 Introduction 

For many practical applications it is not only of interest to have a forecast 
of a meteorological variable, but also to have information about the possible 
deviation of this forecast from the observation. Examples for such applications 
could be the estimation of the expected electricity demand JJJ or the output of 
a wind farm. In both cases, a forecast error estimate would be helpful to reveal 
possible risk exposures. Various measures of the forecast skill provide averaged 
estimates of forecast accuracy. However, at a given time the instantaneous 
accuracy might deviate from the averaged forecast skill, since the former depends 
on the present state of the atmosphere. 

We are interested in predicting large instantaneous differences between a 
deterministic forecast and the corresponding observation (verification). Predic- 
tions of these differences are obtained by post-processing the output of an en- 
semble forecast system in terms of a classification or decision making approach. 
More precisely, we analyze the present state of the ensemble forecast system in 
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order to decide whether a large difference between forecast and verification is 
impending and therefore a warning should be issued. This approach is based 
on the assumption that the ensemble reflects the uncertainty about the future 
state of the atmosphere. In the simplest case, the ensemble can be thought of as 
a collection of equally likely scenarios of the atmosphere's future development. 
Operational ensembles show deviations from this clearly idealistic behavior [5] . 

The idea of the decision making approach to failures of point forecasts is 
related to predictions made through the identification of precursory structures 
in time series. Observations are compared with structures that are believed to 
be relevant precursors for an event that is expected to occur in the near future. 
Precursor based forecasts are typically used in situations that do not allow for a 
modeling of the system under study but provide a time series record of the past. 
Typical examples for predictions through precursory structures are earthquakes 
[TI] , epileptic seizures [13] and predictions of turbulent wind gusts [TO] . 

In this contribution we combine precursor based predictions with determin- 
istic forecasts produced by dynamical atmospheric models. While the dynamic 
model issues deterministic forecasts of a meteorological variable, the precursor 
based approach allows to predict possible failures of these point forecasts. In 
this context, we treat the verification, the dynamic model and the corresponding 
ensemble forecasts as a multivariate time series. Loosely speaking, the precur- 
sors we are looking for live in the subset of the multivariate time series spanned 
by the output of the ensemble forecast system, and the high resolution forecast. 
The events we are aiming to predict live in another subset consisting of the high 
resolution forecast and the time series of observations. The dependence between 
event and precursor can then be understood as a consequence of the fact that 
the high resolution forecast and the ensemble forecasts are supposed to describe 
the same state of the atmosphere. 

In the context of precursor based predictions it has been observed that the 
quality of the predictions can display a strong dependence on the event mag- 
nitude [TOJ [S] . In previous work we studied this dependence of the prediction 
quality on the event magnitude in more detail by predicting events in one- 
dimensional stochastic processes, as well as wind speed recordings [7]. Conse- 
quently, we are now not only interested in predicting large deviations of the 
forecast but also in studying the dependence of these predictions on the thresh- 
old that is used to define the deviations. 

In Sec. [5] we specify the properties of the data record used for this study. In 
Sec.[3|we define the events of interest and analyze their occurrence in the data 
set. In the following section we introduce two different strategies to identify 
suitable precursors, and develop a corresponding setting for decision making. 
In Sec. [S] we use these strategies to issue warnings and analyze the quality of 
these predictions by computing receiver operating characteristics. Brier scores, 
and Ignorance scores. Sec. [51 is devoted to understanding the relation between 
the quality of the predictions and the magnitude of the events under study. We 
summarize in Sec. [7| 

2 The Data 

The forecasts and the corresponding verifications used in this study were pro- 
vided by the European Center for Medium Range Weather Forecasts (ECMWF). 
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Figure 1: The Observation (red/light gray), the high resolution forecast 
(blue/dark gray), and the ensemble (gray patch) over time. The entire data 
set comprises five years. In this plot, all data has been down-sampled to a point 
every 30 days to facilitate visualization. For the same reason, rather than show- 
ing all individual ensemble members, the entire range of the ensembles is shown 
as a patch. 



The data contains the ECMWF's operational medium range deterministic fore- 
cast, run on Ti799L91 [TB], to which we will refer in the following as high 
resolution forecast. Furthermore, the ECMWF's ensemble forecast Ti255L40 
[TB] consists of 50 ensemble members, generated on Ti255L40 resolution with 
perturbed initial conditions, plus one member which evolves the unperturbed 
conditions, the control. We interpret the control as an additional ensemble 
member; thus the total number of ensemble members is M = 51. Each data 
set consists of five subsets that correspond to four neighboring grid points on 
the circulation model. The fifth data set is obtained by interpolation of the 
four surrounding data sets. For all following considerations we use this interpo- 
lated data set, and we focus on the temperature forecast for London Heathrow 
airport, 51°29'N 000°27'W at noon, with a lead time of 120h. However, a pre- 
liminary study suggested that one could obtain qualitatively similar results for 
other lead times. Fig. [1] shows the data set under study, including high resolu- 
tion temperature forecast, ensemble forecasts and observation. The data covers 
the years from 2001 to 2005, comprising N — 1814 data points. 

In the following, h = {hn} denotes the time series generated by the high 
resolution temperature forecast issued at time t„ = to +nAt, with n = 1, . . . , N 
and the time step At being one day. For the data set under study the number 
of time instances is = 1814, starting on to =1 January 2001, 12:00 UTC. The 
corresponding time series of ensemble forecasts are denoted by = {x^}, with 
i — 1,2, ... ,M referring to each ensemble member and n specifying time, as 
above. Analogously, the time series of verifications is denoted hy y — {j/n}- 
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Figure 2: Probability to find events of magnitude rj in tfie given data set. The 
red symbols denote the probability distribution obtained from the original data 
set, the blue symbols indicate the probability distributions obtained from re- 
sampled data sets generated by drawing with repetition. 



3 Characterizing the Events of Interest 



We are interested in forecasting events consisting of deviations of the high res- 
olution forecast /i„ and the verification ?/„ that exceed a given event magnitude 
rj. More precisely, we define the observation variable Xn{'>]) ioi the events of 
interest in the following way 



(1) 



Xniv) 



if 
if 



where Xn iv) — 1 indicates an event at time step i„ , and Xn (v) — consequently 
describes the absence of an event at <„. Hence, we obtain an additional time 
series xiv) = {Xniv)} that keeps track of the occurrence of events. 

The magnitude of an event is often measured in multiples of the standard 
deviation of the time series under study. Since it is -in the context of weather 
forecasting- more relevant to measure the absolute difference between a pre- 
dicted temperature and the observed temperature, we prefer in this contribution 
to measure t] in absolute values, i.e., in Kelvin. 

With respect to an analysis of extreme and rare events, it is useful to start 
with an overview on the range of the values and estimates of the first moments 
for all relevant subsets of the multivariate time series. 
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Hence, the largest observed event \yn — hn\ = 12.35 K has magnitude 7.67 
times the standard deviation of the corresponding time series. Whether it is 
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Figure 3: The observation (line) and the high resolution forecast (dashed line) 
are shown for four consecutive days, along with the ensemble (gray dots). In this 
lead shot plot, the scattering of the ensemble members along the abscissa has no 
significance and is supposed to allow for a better visualization. The deviation 
Un — hn is marked at day No. 405. Also at this day, all ensemble members with 
a distance of more than 2 degrees from the High Res. are represented by green 
bold dots. There are 23 such ensemble members. Hence, the precursory variable 
Vn has the value w„ — 23/M. 

justified to call an event extreme if it is about 7 times larger than the standard 
deviation depends on the underlying distribution. While a 5cr event occurs only 
once within a Gaussian distributed data set of 10^ i.i.d. random numbers, one 
can observe 1000 events larger then 18 times the standard deviation in a power 
law distributed i.i.d. data set of equal size The distribution of the events 
under study can be described reasonably well by a stretched exponential func- 
tion as shown in Fig. [5] In addition to the distribution estimated from the 
original data set, we evaluate also distributions within 20 sample data sets of 
equal lengths created by drawing with repetition from the original data set. 
This bootstrap method reflects the robustness of the estimated distribution to- 
wards small changes in the composition of the data set, which become especially 
prominent in the case of rare events. Extrapolating the relative frequency found 
in the data set under study suggests that one can expect to find about 10000 
events of size 7.7 times cr in a record of length 10^. Hence the largest event we 
observe within the limited size of the data set is not that rare for an exponential 
distribution. However, it is larger and can be expected to occur more often than 
the largest event one would expect if the deviations {|j/„ — were assumed 
to be Gaussian distributed. 

4 Identification of the Precursor 

In this contribution we use two (of many possible) strategies to identify pre- 
cursory patterns that could announce large differences between high resolution 
forecast and ensemble forecast. In both cases, we assume that useful precursory 
patterns can be found by investigating differences of the high resolution fore- 
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cast and the corresponding ensemble forecasts. More specifically, as precursory 
variable we use the relative number of ensemble members that display a 

difference to the high resolution forecast which is larger than a specified mag- 
nitude /3, i.e., 

(2) = *{^^h^~Jn\>P} _ 

Here, denotes the value of the j-th ensemble member at time i„ and M is the 
total number of ensemble members. Fig. [3] illustrates this definition of w„. One 
might consider choosing the threshold /3 as a function of the event magnitude. 
However, the empirical study presented in this contribution shows that even 
the simplest choice for /3(7y), i.e., (3 = ij allows to make reasonable predictions. 
Having defined u„, we can derive a secondary variable Pn{vn) that is then used 
for deciding whether to give an alarm or no alarm for an event. We use two 
different strategies to define this secondary variable, namely the variable v„ itself 
and an approach based on maximizing a conditional probability distribution 
function (CPDF). Both approaches will be explained in detail in the later part 
of this section. According to these strategies for decision making, 

(3a) „ ( p{Xn{v)^Mvn), CPDF method; 

Pn '■= i 

(3b) [ Vn, counting method. 

Here, p{Xn{il) = M^n) denotes the CPDF for finding an event if a certain value 
of Vn is observed. We then give an alarm for an event Xn (^) to occur at a time 
step if the present value of Pn is larger or equal than a given threshold S. 
This announcement of an alarm, based on the present value of w„ {rj) is reflected 
by a binary decision function 



(4) A{pn,S) = 



1 : ifpn><5, with Se[0,l] 
: otherwise. 



The choice of the threshold 6 reflects the tolerance towards deviation of the 
observable Pn from the maximum value that Pn can assume, i.e., unity. As 
(5 f values of p„ that lead to an alarm are either close to the maximum of the 
CPDF or the maximum of the relative number of deviating ensemble members. 
Consequently small values of S, correspond to frequent alarms and lead also to 
a high number of false alarms. 

The first method of defining p„ as introduced in Eq. pap is based on a 
maximizing p{xn{^) = l|wn) • This approach corresponds to the so-called naive 
Bayesian classifier. For the numerical estimates of the CPDFs the numbers 
of bins are chosen with respect to the various measures for the quality of the 
predictions that will be introduced in the following section. In more detail, 
the number of bins b is chosen such that we observe the respective score to 
be optimal in the regime where sufficiently many events are available, i.e., for 
small and intermediate values of rj. Fig. 2] shows examples for the estimates of 
the CPDF used for the generation of Brier scores and ignorance scores (6 = 12 
in both cases) and for the generation of ROC-curves (b=26). Using more bins 
increases the specificity and hence improves the resulting ROC curves. On the 
other hand, increasing the number of bins leads to an increased variance of the 
estimated CPDFs, especially in the limit of very few events, i.e., large values of 
•q. We test for this increase in variance through cross-validation. 
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Figure 4: Numerical estimates of the CPDF pixniv) = M^n) as they are used for 
the computation of the Brier score, the ignorance and ROC-curves. We chose 
to work with the number of bins b that generate optimal scores, i.e., b — 12 
for the computation of the Brier score and the ignorance and 6 = 26 for the 
computation of ROC-curves. 



The second method of identifying suitable precursory structures is based on 
the assumption that the ensemble members represent equally likely scenarios of 
the future evolution of the atmosphere. Hence a large ensemble spread, reflects 
a state of the atmosphere in which forecasts are difficult to make. In other 
words, if the high resolution forecast differs significantly from the verification, we 
assume that this failure of the forecast is due to an increased sensitivity on small 
perturbations in the initial conditions. We can then hope that this sensitivity 
is not only present in the atmosphere, but also reflected by the ECMWF's 
circulation model. Consequently, we would expect a large number of ensemble 
members to deviate from the high resolution forecast. 

The estimates of the CPDFs, as shown in Fig. U indicate that we can assume 
PiXniv) = M^n) to be a monotonous function (if we attribute the fluctuations to 
finite sample effects) . Consequently, the values of Vn that lead to an alarm for an 
extreme event in terms of the counting approach will in many cases also generate 
an alarm in the CPDF-approach. Remembering that pixniv) = M^n) are the 
observed frequencies of events, given t)„ and interpreting w„ as the forecast 
probabilities for an event, we can even think of the plots in Fig. 2] in terms of a 
reliability diagram [14^. If the original ensemble and the verification ?/„ were 
independent draws from the same distribution, the curves should coincide with 
the diagonal. For very large and very small events, we observe deviations from 
the diagonal, that can be either attributed to the limited amount of available 
data in these regimes or to systematic deviations. 
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Figure 5: ROC-curves generated through estimates of conditional probabihty 
distributions, i.e p„ — PiXniv) = ll'^n)-, and leave-one-out cross-validation. As 
in Fig. ini the lines without symbols represent ROC-curves computed within 
re-sampled data sets. 

5 Evaluating the Quality of the Predictions 

In the following sections we present the results for the predictions made accord- 
ing to the CPDF method and the counting method as specified in Sec. S) In 
order to evaluate the quality of the predictions we use different measures, namely 
the ROC-curve [3] and the Brier Score [1] and the Ignorance [17]. In-sample 
predictions were made for the lead times 24h, 48h, 96h, 144h, 168h, 192h, 216h, 
240h. Since we do not find qualitatively different results for different lead times 
(with the exception of 24h) we restrict ourselves to the discussion and further 
investigation of the forecasts with a lead time of 120h. To support the validity 
of the results obtained from our relatively small sample (1814 data points) we 
have to consider different sources of uncertainty: the robustness of our results 
towards small changes in the specific composition of the data set and the influ- 
ence of over-fitting due to in-sample prediction. The later effect is only an issue 
for the CPDF method, since no training is needed to identify the precursor for 
the counting method. 

We test for the robustness towards small changes in the composition of the 
sample by re-sampling the data set (Bootstrap method). The re-sampling is 
done by creating 20 test data sets by drawing with repetition from the original 
data set and applying the same training and prediction algorithm. Using the 
re-sampled data sets leads to a slightly different estimate of the probability 
distribution. Additionally to the results obtained from the original data set, we 
hence obtain a distribution of results based on the re-sampled data sets, which 
can serve as an estimate of the variance of the original results. The random 
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Figure 6: ROC-curves for the prediction of differences between a high resolution 
forecast and its verification. The ROC-curves were created using the number 
of deviating ensemble members as a predictor, i.e., Pn = Vn- This method is 
introduced as counting method in Sec. |4l The points represent the original data 
set, the lines represent 20 bootstrap-samples. 

generator used for the re-sampling is the Mersenne twister [T^] , as implemented 
in the Gnu Scientific Library [3]. 

In order to test for the effects of in-sample prediction, we repeat the pre- 
dictions doing leave-one-out cross validation (also called total cross validation). 
We therefore train on all but one data points and predict the occurrence/non- 
occurrence of an event in the left out time step. In the context of maximum 
CPDF estimation, "training" refers to determining the CPDF. This procedure 
(training and prediction) is repeated N times, with N being the number of 
data and a different data point left out for each repetition. Since the counting 
method needs no training, we apply the leave-one out cross validation only for 
the CPDF method. 

5.1 ROC and AUG 

A common method to evaluate the success of a classification task is the receiver 
operating characteristic curve (ROC-curve) [51 [3]. We first compute the rate 
Tc of correctly predicted events (hit rate, rate of true positives, sensitivity) to 
the rate Vf of false alarms (rate of false positives, 1— specificity). A ROC-curve 
comprises a plot of Tc against r/ as e.g. in Figs. [5] and [5] Numerically, these 
rates can be computed from the time series of the precursory variable {vn} and 
the time series of the events {Xn{v)} by simple counting. For each value of 
the threshold S one obtains a point in the rc—rf plane. If S is assumed to be 
a continuous variable one arrives at a curve parametrized by 6. The resulting 
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Figure 7: Comparison of counting and CPDF method. 



curve in the unit-square of the r f-Vc plane approaches the origin ioi S 1 and 
the point (1, 1) in the hmit S —?' 0. A curve above the diagonal reveals that the 
corresponding strategy of prediction is better than a purely random prediction, 
which is characterized by a ROC-curve along the diagonal. 

ROC-curves that characterize the success of counting method and CPDF 
method are presented in Figs. [5] and [5] The lines with symbols represent ROC- 
curves generated from the original data set. The lines without symbols repre- 
sents ROC-curves computed from 20 re-sampled data sets. As mentioned in the 
previous sections the number of bins used to estimated the CPDF was chosen 
with respect to the methods used to quantify the success of the predictions. 
Concerning ROC-curves, we found that a finer binning lead to optimal results, 
i.e, the number of bins used to estimate CPDFs is 26. Although the correspond- 
ing estimates of conditional probability distributions are not smooth functions 
of Vn, they produce reasonable good ROC-curves, since the fine binning results 
in a high specificity of the observed precursor. One can see that the quality of 
the forecasts (as far as it is quantified by the ROC curve) increases with the 
magnitudes of the events under study. This result is consistent with the find- 
ings previously observed for precursor based predictions in time series and 
to the prediction of precipitation [5]- If we focus on larger events, then also the 
sensitivity to small variations in the data set under study increases, as is shown 
by the spread of the ROC-curves obtained from the re-sampled data sets. This 
is not surprising, since larger events occur less often and hence small deviations 
in their frequency of occurrence become more prominent. 

Figs. [7] and [5] compare CPDF and counting method, as well as the influence of 
the cross validation. The area under the ROC-curve (AUC) is a well established 
summary index for ROC-curves, see e.g., pTS] for other summary indices. An 
optimal prediction is characterized by an AUC of unity, random predictions are 
reflected by a diagonal in the ROC-plane and hence an AUC of 0.5. 

Both the ROC-curves in Fig. [7] and the corresponding AUCs in Fig. [S] show 
that the success of counting method and CPDF method does not differ signifi- 
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Figure 8: The area under the ROC-curves for the predictions according to 
maximum CPDF and counting method. The two ROC-curves for the CPDF 
method correspond to in-sample predictions and leave-one-out cross validation. 
Confidence-bars were evaluated using re-sampled data sets, as it is described in 
more detail for the generation of confidence-bars for scores in Sec. 15.21 

cantly in the regime of 77 < 5. For larger event magnitudes, i.e., in the regime of 
very few events, the counting method performs better than the CPDF method. 
In this regime, the difference between in-sample prediction and leave-one-out 
cross validation becomes more prominent as well. Comparing the AUCs in 
Fig. \8\ with the scores in Figs. [9] and [10] in the following chapter, the regime 
where one can observe a clear difference between counting method, in-sample 
prediction, and total cross-validation starts earlier, that is for smaller values 
of rj. This observation can probably be attributed to a higher variance of the 
estimates of the conditional probability distribution used to generate AUCs and 
ROC-curves, provoked by an increased number of bins. 

In theory, choosing p„ = p{xn{ii) = l|^^n) should be the optimal strategy for 
a prediction, if it could be estimated with arbitrary accuracy. However, as we 
saw in Figs.[S][51 given a finite data set, the counting method leads to very similar 
or even better results. The success of the counting method can be understood 
by taking into account that the CPDF (as indicated by Fig. |4|) appears to be 
monotonously increasing with u„. Hence, the maximum of the CPDF can be 
expected to be close to large values of w„ that are also considered to be good 
precursors in terms of the counting approach. Or vice versa, a value of u„ that 
leads to an alarm according to the counting method, does also lead to an alarm 
in terms the maximum CPDF approach. However, since the counting method is 
not based on any training procedure or method of estimation it is independent on 
the number of available events. Consequently, the counting methods performs 
very similar to the CPDF approach, if the CPDF can be estimated well from the 
available number of events, and it leads to better predictions, if the estimates 
of the CPDF are poor, as e.g., in the case of very large events. 
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Figure 9: Relative Brier scores for the counting method and the maximum 
CPDF approach. In order to test for over-fitting, the CPDFs were addition- 
ally estimated by using leave-one-out-cross-validation. The generation of the 
confidence-bars is explained in more detail in Sec. 15.21 



5.2 Brier Score and Ignorance 

To test whether the results presented in the previous section depend on a mea- 
sure for the quality of a prediction, we also compute Brier scores [T and ig- 
norance scores ^7}. Both scores are common methods to test for qualities of 
predictions in the framework of weather forecasting. The Brier score 1, is de- 
fined as 

1 ^ 

(5) B{x{v),p) = j;^Y.(^n{v)-Pnr, 

71=1 

where p = {p„} denotes the time series of successive p„. As defined in Eq. ^ 
p is either chosen to be the CPDF pixniv) = M'f^n) or the relative number of 
ensemble members u„ (77) . A suitable choice of p„ , produces a small value of 
B {x{'n)i'P)i since we expect the value of p„ to be approximately unity if an 
event is observed (Xn(?7) — 1) and to be close to zero otherwise. Relative scores 
(Skill scores) measure whether a forecast method is better than the forecast 
given by the relative frequency f = p (Xniv) — 1) of events. Consequently, the 
relative Brier score (Brier skill score) is defined as 

(6) S.. ixiv),P, f) ^j^^ , 

where B {x{''l)^ /) denotes the Brier score obtained for the relative frequency of 
observed events /. 

The ignorance score [17 for a binary event is given by 

1 ^ 

Hx{v),p) = ~ 

n=l 

(7) log(l-p„)-(l-x„(r/)). 
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Figure 10: The relative ignorance scores of the decision making strategies de- 
fined in Eq. In order to test for over- fitting, the estimation of the conditional 
probability density was repeated with leave-one-out-cross validation. The gen- 
eration of confidence-bars is explained in more detail in Sec. 15.21 



Similar to the Brier score, a good prediction is characterized by a small value 
of the ignorance score. In other words, one aims at minimizing the ignorance. 

To make sure that numerical estimates of p„ do not cause a divergent ig- 
norance, we add two imaginary ensemble members. Of these two imaginary 
members, one is always counted as a deviation from the high resolution, while 
the second one is assumed to never deviate from the high resolution forecast. 
Thus, we always ensure that j^!^ ^ Pn ^ This method of regularization 

introduces some bias towards the estimation of p„ , but it is necessary in order 
to use the ignorance for the evaluation of yes/no- forecasts (binary forecasts). 
Analogously to the relative Brier score, the relative ignorance 

(8) Irel ixivlP, f) = Tl^(^) 

compares the ignorance obtained from the predictive distribution p with the 
ignorance of the climatology /. Both, Brier score and ignorance indicate a good 
quality of the predictions when their values are small. However, this implies 
that the respective relative scores are expected to approach unity, for an ideal 
forecast and zero for a forecast, that is not better than the climatology. In 
other words, when comparing relative scores, a larger value of a relative score 
indicates a better forecast. 

The relative scores evaluated for counting and CPDF method are shown in 
Figs. l9l and [TOl The confidence bars in Figs. l9l and ITOl reflect the variance of the 
ensemble of re-sampled data sets. Especially for large event magnitudes, not all 
re-sampled data sets provide a sufficient number of events for the evaluation of 
scores. Hence, if all re-sampled data sets allowed the computation of scores, the 
confidence bars were estimated from 20 re-sampled data sets. However, if the 
re-sampled data sets did not provide estimates of the scores, for given values of 
r] and Vn, we estimated the confidence bars from a smaller number of re-sampled 
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data sets. We decided to plot an estimated confidence bar for any given values 
of 77 and Vn whenever at least 5 re-sampled data sets provided scores. 

Both, Brier score and the ignorance increase with increasing event magni- 
tude in the regime < 77 < 3. For rj > 3 the Brier Score is approximately 
constant, while the ignorance continues to increase until rj ^ 6. If 77 > 6, the 
confidence bars of both scores increase significantly, and for the ignorance we 
observe in addition a qualitatively different dependence on 77 for counting and 
CPDF method. Both effects can probably be attributed to finite sample effects 
that become more prominent for larger and thus rarer events. 

With respect to the comparison of the two different strategies of defining 
Pn, the scores show results that are consistent with the ROC and AUG curves 
discussed in the previous section. For small values of 77, i.e., in the regime 
where many events are available and the uncertainty in the estimation of the 
scores is small, the results for the counting method and the CPDF method with 
and without cross validation coincide. The increase of the difference between 
in-sample CPDF prediction and cross-validated CPDF prediction can be un- 
derstood by considering the effect of leave-one-out-cross-validation on scores. 
By reordering the temporal indices in the summation in Eq. ([5]) and ([7]) one 
arrives at the following expressions for scores Bcv and lev estimated with 
leave-one-out-cross- validation (CV) 



where Nk denotes the number of entries in the bin associated with the value Vk 
and Mk the number of entries in bin k that coincide with an event. According to 
Eq. the terms that contribute to Brier scores estimated with leave-one-out- 
cross- validation are by a factor of N^/{Nk — 1)^ larger than the corresponding 
terms evaluated on the full data set. Since Mk < Nk, it is easy to see that 
the ignorance evaluated by leave-one-out-cross- validation is also larger than the 
ignorance computed on the whole data set. Consequently one can expect the 
corresponding relative scores to be smaller. For both scores the effect of the 
cross-validation becomes more prominent in the limit of very few entries in the 
histogram of the CPDF, i.e., small values of Nk as we observe for very large 
values of 77. 

6 Understanding the Dependence on the Event 
Magnitude 

The ROC-curves and scores presented in the previous section indicated a better 
predictability of larger events. Intuitively one could expect these larger events 
to be harder to predict, since they are rare. On the other hand, they are clearly 
distinguishable from the average event, and one can thus expect that they are 
also preceded by a more distinguished precursory signal. 



(10) 



(9) 
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Figure 11: The numerical estimates of c{r], u„) according to Eg. 1121 as a function 
of p for selected fixed values of 77; confidence bars were again created using 20 re- 
sampled data sets and drawn if at least 5 of the 20 data sets provided estimates 
of 0(77, w„). 
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Figure 12: This plot indicates whether 0(77, w„) is positive or negative for a given 
pair of coordinates in the w„, 77 plane. 



In order to test for the magnitude dependence, we consider again ROC- 
curves as a measure for the quality of a prediction. The slope of the ROC-curve 
can be identified to be the likelihood ratio ( 3 ) 



(11) 



A(w„,?7) 



P(Vn\Xniv) 



I) 



PiVnlXniv) = 0)' 



If we understand the likelihood ratio as a function of «„, Eq. (jlip describes 
a family of functions parametrized by 77. We can then simply investigate the 
dependence of the likelihood ratio on the event magnitude by computing the 
derivative of A(i;„,ry) with respect to the event magnitude. Rearranging the 
equation yields the following condition [7], 



(12) 



d_ 
drj 
{I 



lnp(Xn('7) = l|Wn) - 
~P{Xn{r]) = l\Vn))) d 

— -T— — lnp(x„ 

l-p(Xn?7) = l) 



I) 



This expression has the same sign as the derivative of the likelihood ratio, i.e., 

d 



(13) 



sign 



drj 



A{vn,ri) = sign c(7], u„). 



Consequently, if c(?7, v„) > 0, the corresponding families of likelihood ratios and 
ROC-curves reveal that the quality of the predictions increases, if one focuses on 
larger events. If c(ri,Vn) < 0, the corresponding families of likelihood ratios and 
ROC-curves show a negative magnitude dependence. In this case larger events 
are harder to predict, and if c{r],Vn) = 0, likelihood ratio and ROC-curves are 
not dependent on the event magnitude. The results of the numerical estimation 
of c{r], Vn) are presented in Fig. [TTJ Note that 0(77, w„) is not dependent on the 
choice of the precursor and on the methods of decision making (e.g., maximum 
CPDF approach or counting), but simply a function of c{r],Vn)- Nevertheless, 
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the computation of c{ri, u„) requires numerical estimates of the CPDF and the 
chmatology, as weh as their derivatives. The derivatives of the estimates of 
both distributions are obtained using Savitzky-Golay-filtering (TB]. As in the 
previous sections, confidence bars represent the double standard deviations of 
results obtained from 20 re-sampled data sets. However, especially large events 
cannot be expected to be well represented in some of the sample data sets and 
consequently it was for certain values of 77 and not possible to evaluate the 
condition for all 20 re-sampled data set. In order to use as much of the available 
information as possible, we decided to plot confidence bars, if at least 5 members 
of the re-sampled data sets could produce estimates of c(?7, w„). 

Since the evaluation of c(?7, Vn) is based on the numerical estimation of deriva- 
tives of PDFs, finite sample effects become even more prominent, compared to 
the computation of the scores. Consequently, due to the relatively small size 
of the data set, 0(77, u„) is very noisy, (see FiglTTl) compared to previous results 
for other prediction tasks that were generated using larger data sets. Never- 
theless, we find that the numerical estimates of c(?7, Vn) have positive values for 
the majority of coordinates in the ry-t^n-plane, as it is shown in Fig |12l Strictly 
speaking, the dependence on the event magnitude can vary for every point in the 
7], z;„-plane, since c(ry, ?;„) is a function of both variables. However, for practical 
considerations, one might be interested in the overall behavior, independently of 
the specific choice of the precursor or the event threshold. That is why we also 
studied whether averages of c(?7, w„), i.e, (c(?7,'y„))^ and {c{ri,Vn))v„ can also 
correctly characterize the overall dependence on the event magnitude, without 
referring to a specific value of the precursory variable or the event size. Fig. 1131 
shows that {c{r],Vn))ri is positive for most values of the precursory variable z)„. 
In particular it is positive for larger values of Vn, which are relevant as precur- 
sors for large events. The averages over the precursory variable Vn, displayed in 
Fig- [SI are also positive for almost all values of 77. The fact that the exceptions 
occur within the regimes of very small and very large values of rj support the 
assumption, that we can attribute them to finite sample effects, since in both 
regimes the corresponding events are rare. 

In total the evaluation of the condition c(?7, u„) and its averages over ry or w„ 
suggest a positive magnitude dependence. 

7 Conclusions 

We use information obtained from the ensemble forecast and the high resolution 
forecast to successfully predict errors of the high resolution forecast. The output 
of the ensemble forecast system is post-processed in terms of a decision making 
approach. More precisely, the number of ensemble members that display a large 
difference to the corresponding high resolution forecast can serve to predict the 
events we are interested in. The quality of the predictions was evaluated using 
the Brier score, the ignorance score, and ROC-curves. Comparing two different 
strategies of decision making, we find no significant difference in the success 
of the counting method and the numerically more expensive, but theoretically 
better justified maximum CPDF approach. This is surprising, since the counting 
method consists simply in imposing a threshold to the number of ensemble 
members that show large deviations from the high resolution forecast. However, 
the similarity of the results can be understood if we consider the fact that the 
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Figure 13; The lower figure shows the condition 0(77, w„), averaged over all 
investigated values of w„. The upper graph displays the condition 0(77, 
averaged over all investigated values of rj. 

CPDF is a monotonously increasing function and consequently, the counting 
approach mimics the choices of suitable values of the precursory variable that 
would have been chosen according to the maximum CPDF approach. 

Additionally, all different methods to evaluate the quality of the predictions 
(ROC-curves, the relative Brier score and the relative ignorance score) display 
an increase in the quality of the predictions with increasing event magnitude. 
This increase is particularly apparent in regimes of the event magnitude that 
are well supported by the amount of observed events. This positive magni- 
tude dependence of the ROC-curves could be reproduced as well through a test 
condition for the magnitude dependence of ROC-curve and likelihood ratio. 
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